featauR:
Automated Feature Selection for Machine Learning Algorithms
We are a consulting company for data science, machine learning and statistics with offices in Frankfurt, Zurich and Stuttgart. We support our customers in the development and implementation of data science and machine learning solutions.
Data science projects often follow a similiar structure. At the very beginning, one must load and prep the data, of course. Everything afterwards is fun, the first two parts are not.
Feature Selection is one of the most fundamental tasks in the data science workflow:
Currently there are two main ways to select the relevant features out of the entire feature space:
Componentwise Gradient Boosting is a boosting ensemble algorithm allowing to discriminate the relevance of features. In its essence, the method follows this algorithm:
The package is still under developmet and not yet listed on CRAN. However you can get it from GitHub.
# load devtools
install.packages(devtools)
library(devtools)
# download from our public repo
devtools::install_github("STATWORX/bounceR")
# source it
library(bounceR)
If you find any bugs or spot anything that is not super convenient, just open an issue.
The package contains a variety of useful functions surrounding the topic of feature selection, such as:
sim_data: a function simulating regression and classification data, where the true feature space is knownfeatureFiltering: a function implementing several popular filter methods for feature selectionfeatureSelection: a function implementing our home grown algorithm for feature selectionprint.sel_obj: an S4 priniting method for the object class “sel_obj”plot.sel_obj: an S4 ploting method for the object class “sel_obj”summary.sel_obj: an S4 summary method for the object class “sel_obj”builder: method to extract a formula with n features from a “sel_obj”Each round a random feature importance distribution is initialized. Over the course of \( m \) models, the distribution is adjusted. Essentially our code follows the algorithm:
Essentially we take bits form cool algorithms and put them together. For once, we leverage the complete randomness of random forests. Additionally we apply a somewhat transformed idea of backpropagation.
Sure, you have a lot of tuning parameters, however we put them all together in a nice and handy little interface. By the way, we set the defaults based on several simulation studies, so you can - sort of - trust them - sometimes.
# Feature Selection using bounceR-----------------------------------------------------
selection <- featureSelection(data = train_df,
target = "target",
index = NULL,
selection = selectionControl(n_rounds = 100,
n_mods = 1000,
p = NULL,
reward = 0.2,
penalty = 0.3,
max_features = NULL),
bootstrap = "regular",
boosting = boostingControl(mstop = 100, nu = 0.1),
early_stopping = "aic",
n_cores = 6)
If you have any questions, are interested or have an idea, just contact us!